# if you have not used plotly before, uncomment this cell and run once
# !pip install plotly
# Set up notebook to display multiple outputs in one cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
%matplotlib inline
If you go back and look at the Module 5 code, Year was saved as type object, but below it is read in by default as an int64. The Year column will not be used for mathematical operations in our analysis, but instead will be used as a label in plots and is better served as an object. We could convert the variable after the pandas read or we can convert is as we read in the data.
# read in happiness scores
#merged = pd.read_csv('Happiness_clean.csv') # will read in Year as int64
merged = pd.read_csv('Happiness_clean.csv', dtype = {'Year': str})
merged.info()
merged.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 566 entries, 0 to 565 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country 566 non-null object 1 Year 566 non-null object 2 Life Ladder 566 non-null float64 3 Log GDP 566 non-null float64 4 Social support 566 non-null float64 5 Life Expectancy 566 non-null float64 6 Choice Freedom 566 non-null float64 7 Generosity 566 non-null float64 8 Corruption 566 non-null float64 9 Positive affect 566 non-null float64 10 Negative affect 566 non-null float64 11 Government confidence 566 non-null float64 12 Population 566 non-null int64 dtypes: float64(10), int64(1), object(2) memory usage: 57.6+ KB
| Country | Year | Life Ladder | Log GDP | Social support | Life Expectancy | Choice Freedom | Generosity | Corruption | Positive affect | Negative affect | Government confidence | Population | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 2015 | 3.982855 | 7.634466 | 0.528597 | 52.599998 | 0.388928 | 0.085082 | 0.880638 | 0.491410 | 0.339276 | 0.260557 | 34413603 |
| 1 | Afghanistan | 2016 | 4.220169 | 7.629037 | 0.559072 | 52.924999 | 0.522566 | 0.047488 | 0.793246 | 0.501409 | 0.348332 | 0.324990 | 35383028 |
| 2 | Afghanistan | 2017 | 2.661718 | 7.629684 | 0.490880 | 53.250000 | 0.427011 | -0.116068 | 0.954393 | 0.435270 | 0.371326 | 0.261179 | 36296111 |
| 3 | Afghanistan | 2018 | 2.694303 | 7.617663 | 0.507516 | 53.575001 | 0.373536 | -0.088125 | 0.927606 | 0.384561 | 0.404904 | 0.364666 | 37171922 |
| 4 | Afghanistan | 2019 | 2.375092 | 7.632903 | 0.419973 | 53.900002 | 0.393656 | -0.103467 | 0.923849 | 0.324108 | 0.502474 | 0.341482 | 38041757 |
The Seaborn pairplot will show a histogram of each variable and a scatterplot of each combination of variables. These views allow you to see the distribution of each individual field and also how each field interacts with the other fields.
In the example below, we are showing Life Ladder, Log GDP and Corruption. There are a few things of note in the plots:
# pairplots on three fields
columns = ['Life Ladder','Log GDP','Corruption']
columns
sns.pairplot(merged[columns])
['Life Ladder', 'Log GDP', 'Corruption']
<seaborn.axisgrid.PairGrid at 0x20888c33730>
import scipy
from scipy.stats import skew
print(skew(merged['Corruption'], axis = 0, bias = True))
merged['Corruption'].skew()
-1.45433598779675
-1.4582033326732935
Data is considered to be skewed when the skew value is +3 or -3. Even though Corruption only has a skew of -1.45, below is sample code of testing out two functions to normalize the data - Numpy's log and square function.
log is typically used for postively skewed datasquare is typically used for negatively skewed data# testing out log on negatively skewed data and it did not improve the skew
print('------- before log --------------')
merged['Corruption'].skew()
test = np.log(merged['Corruption'])
print('------- after log ---------------')
test.skew()
#test
------- before log --------------
-1.4582033326732935
------- after log ---------------
-2.6417261913936647
Using numpy's log function did not improve the skew value. In fact, the negative skew got further to the left from 0. As you can see with the square function, the skew value improved and got closer to 0.
# testing out square on negatively skewed data and it did improve skew
print('------- before square --------------')
merged['Corruption'].skew()
print('------- after square ---------------')
test = np.square(merged['Corruption'])
test.skew()
#test
------- before square --------------
-1.4582033326732935
------- after square ---------------
-0.8623948258485358
If you wanted to add the adjusted field to your data, you could use code like what is shown below.
'''merged['Corruption'].skew()
merged['Corruption_skew'] = np.square(merged['Corruption'])
merged['Corruption_skew'].skew()'''
"merged['Corruption'].skew()\nmerged['Corruption_skew'] = np.square(merged['Corruption'])\nmerged['Corruption_skew'].skew()"
#TODO create pairplot
cols = ['Life Expectancy', 'Log GDP', 'Corruption', 'Social support', 'Choice Freedom', 'Generosity', 'Population']
sns.pairplot(merged[cols])
<seaborn.axisgrid.PairGrid at 0x2088fbac520>
This plot between 'Log GDP' and 'Life Expectancy' a strong positive relationship. As the Log GDP increases, the Life Expectancy also tends to increase.
There are five outliers in the Population pairplots. These could be countries with the highest populations.
We could start by plotting a box plot for Population, we calculate the first quartile (Q1), third quartile (Q3), and interquartile range (IQR) for the 'Population' data. We define an outlier threshold as Q3 + 1.5 * IQR, following the common definition of an outlier in a box-and-whisker plot.
Finally, we identify and print out the potential outliers, which are those countries with a population greater than our outlier threshold.
If the outliers correspond to countries with the highest populations, this could suggest that these outliers are not errors or anomalies, but are simply extreme data points. In such cases, the decision of how to handle these outliers would depend on the specific context and purpose of the analysis.
#TODO: Display the population skew
print('Population Skew: {:0.5f}'. format(merged['Population'].skew()))
#TODO: Determine the skew value after a log transformation
log_pop = np.log(merged['Population'])
print('Population Skew after log transformation: {:0.5f}'.format(log_pop.skew()))
#TODO: Display the skew value after a square transformation
sq_pop = np.square(merged['Population'])
print('Population Skew after square transformation: {:0.5f}'.format(sq_pop.skew()))
Population Skew: 8.24050 Population Skew after log transformation: 0.21209 Population Skew after square transformation: 10.43757
We will create a correlation matrix and Seaborn heatmap to investigate how happiness measures and population are related, if at all. More information on creating a heatmap using Seaborn can be found here: Seaborn heatmap
What does the correlation matrix tell us?
Log GDP and Life Ladder which suggests that those countries with a strong GDP have citizens with a better life.Corruption and Life Ladder which suggests the less corruption, the better for a country.Corruption and Log GDP which suggests that the less corruption, the better the GDP.# set up files for correlations
columns = ['Life Ladder','Log GDP', 'Corruption']
df_corr = merged[columns]
# creates a correlation matrix
corrmat = df_corr.corr()
corrmat
# heatmap of correlation matrix
f, ax = plt.subplots(figsize = (4, 4))
sns.heatmap(corrmat, vmax = .8, square = True, annot = True, cmap = 'RdYlBu', linewidths = .5 )
| Life Ladder | Log GDP | Corruption | |
|---|---|---|---|
| Life Ladder | 1.000000 | 0.803193 | -0.476317 |
| Log GDP | 0.803193 | 1.000000 | -0.374004 |
| Corruption | -0.476317 | -0.374004 | 1.000000 |
<Axes: >
# TODO: Create a heat map using all eleven numeric variables and use a new color combination.
# set-up files for correlations
num_data = merged.select_dtypes(include=[np.number])
num_data.shape
num_cols = num_data.columns
# create the correlation matrix
corr_mat = num_data.corr()
corr_mat
# create the heatmap for correlation matrix
f, ax = plt.subplots(figsize = (16, 16))
sns.heatmap(corr_mat, vmax=0.8, square=True, annot=True, cmap='PiYG', linewidths=0.5)
(566, 11)
| Life Ladder | Log GDP | Social support | Life Expectancy | Choice Freedom | Generosity | Corruption | Positive affect | Negative affect | Government confidence | Population | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Life Ladder | 1.000000 | 0.803193 | 0.764924 | 0.783197 | 0.547298 | 0.102491 | -0.476317 | 0.476297 | -0.468102 | -0.135352 | -0.104939 |
| Log GDP | 0.803193 | 1.000000 | 0.790056 | 0.878006 | 0.375866 | -0.067223 | -0.374004 | 0.243472 | -0.538274 | -0.262434 | -0.059433 |
| Social support | 0.764924 | 0.790056 | 1.000000 | 0.709735 | 0.421265 | 0.002527 | -0.278219 | 0.390959 | -0.611946 | -0.190980 | -0.148190 |
| Life Expectancy | 0.783197 | 0.878006 | 0.709735 | 1.000000 | 0.378678 | -0.020898 | -0.340850 | 0.222021 | -0.441006 | -0.275906 | -0.068245 |
| Choice Freedom | 0.547298 | 0.375866 | 0.421265 | 0.378678 | 1.000000 | 0.275339 | -0.507638 | 0.596388 | -0.361122 | 0.431481 | 0.075748 |
| Generosity | 0.102491 | -0.067223 | 0.002527 | -0.020898 | 0.275339 | 1.000000 | -0.321057 | 0.246817 | -0.052446 | 0.410747 | 0.109213 |
| Corruption | -0.476317 | -0.374004 | -0.278219 | -0.340850 | -0.507638 | -0.321057 | 1.000000 | -0.337151 | 0.397675 | -0.464388 | 0.047736 |
| Positive affect | 0.476297 | 0.243472 | 0.390959 | 0.222021 | 0.596388 | 0.246817 | -0.337151 | 1.000000 | -0.307170 | 0.173050 | -0.017792 |
| Negative affect | -0.468102 | -0.538274 | -0.611946 | -0.441006 | -0.361122 | -0.052446 | 0.397675 | -0.307170 | 1.000000 | -0.057732 | 0.080105 |
| Government confidence | -0.135352 | -0.262434 | -0.190980 | -0.275906 | 0.431481 | 0.410747 | -0.464388 | 0.173050 | -0.057732 | 1.000000 | 0.135683 |
| Population | -0.104939 | -0.059433 | -0.148190 | -0.068245 | 0.075748 | 0.109213 | 0.047736 | -0.017792 | 0.080105 | 0.135683 | 1.000000 |
<Axes: >
The correlation data indicates a slight negative association between population size and variables like Life Ladder, Log GDP, Social Support, Life Expectancy, and Positive Affect. This hints at a possible decline in these parameters as population increases, although these relationships are very weak and thus not substantial. Likewise, although there are positive correlations between Population and Government Confidence, Negative Affect, and Generosity, these relationships are also rather weak, underlining their marginal influence.
Life Expectancy shows substantial positive associations with factors such as Life Ladder, Log GDP, Social Support, Choice Freedom, and Positive Affect, indicating that an increase in life expectancy often aligns with an increase in these measures. On the other hand, Life Expectancy exhibits weaker negative correlations with Generosity, Corruption, Government Confidence, and Population, implying that these parameters might slightly decrease as life expectancy increases, with the relationship with Corruption being somewhat more pronounced.
One way to investigate data is to pull a smaller subset of data so that it is easy to inspect each row and column.
fin = merged[merged['Country'] == 'Finland']
fin
| Country | Year | Life Ladder | Log GDP | Social support | Life Expectancy | Choice Freedom | Generosity | Corruption | Positive affect | Negative affect | Government confidence | Population | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 169 | Finland | 2015 | 7.447926 | 10.716029 | 0.947801 | 70.699997 | 0.929862 | 0.108773 | 0.223370 | 0.736426 | 0.191058 | 0.557600 | 5479531 |
| 170 | Finland | 2016 | 7.659843 | 10.740882 | 0.953940 | 70.775002 | 0.948372 | -0.029360 | 0.249660 | 0.768806 | 0.181998 | 0.485727 | 5495303 |
| 171 | Finland | 2017 | 7.788252 | 10.769960 | 0.963826 | 70.849998 | 0.962199 | -0.004811 | 0.192413 | 0.755858 | 0.176066 | 0.597539 | 5508214 |
| 172 | Finland | 2018 | 7.858107 | 10.779988 | 0.962155 | 70.925003 | 0.937807 | -0.129722 | 0.198605 | 0.748826 | 0.181781 | 0.555102 | 5515525 |
| 173 | Finland | 2019 | 7.780348 | 10.792235 | 0.937416 | 71.000000 | 0.947617 | -0.054119 | 0.195338 | 0.732282 | 0.180733 | 0.639188 | 5521606 |
The heatmap allows us to see which variables have the highest values within the top 20 countries based on Life Ladder scores for 2015.
# 2015 only data
year2015 = merged[merged['Year'] == '2015']
# top 20 of 2015 based on life ladder
t2015 = year2015.sort_values(by = ['Life Ladder'], ascending = False)[:20]
t2015.style.background_gradient(cmap = 'Greens')
| Country | Year | Life Ladder | Log GDP | Social support | Life Expectancy | Choice Freedom | Generosity | Corruption | Positive affect | Negative affect | Government confidence | Population | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 404 | Norway | 2015 | 7.603434 | 11.050693 | 0.946834 | 71.199997 | 0.947621 | 0.252847 | 0.298814 | 0.796321 | 0.209410 | 0.586872 | 5188607 |
| 498 | Switzerland | 2015 | 7.572137 | 11.127645 | 0.938334 | 71.699997 | 0.927802 | 0.104062 | 0.209534 | 0.794054 | 0.165759 | 0.787730 | 8282396 |
| 137 | Denmark | 2015 | 7.514425 | 10.876019 | 0.959701 | 70.500000 | 0.941436 | 0.219611 | 0.191016 | 0.801433 | 0.217578 | 0.579889 | 5683483 |
| 228 | Iceland | 2015 | 7.498071 | 10.861744 | 0.980283 | 71.900002 | 0.940485 | 0.297560 | 0.638662 | 0.794291 | 0.179504 | 0.427228 | 330815 |
| 169 | Finland | 2015 | 7.447926 | 10.716029 | 0.947801 | 70.699997 | 0.929862 | 0.108773 | 0.223370 | 0.736426 | 0.191058 | 0.557600 | 5479531 |
| 379 | New Zealand | 2015 | 7.418121 | 10.623425 | 0.987343 | 69.900002 | 0.941784 | 0.325948 | 0.185889 | 0.794508 | 0.159830 | 0.620908 | 4609400 |
| 94 | Canada | 2015 | 7.412773 | 10.768951 | 0.939067 | 71.099998 | 0.931469 | 0.250651 | 0.427152 | 0.791709 | 0.286280 | 0.644104 | 35702908 |
| 374 | Netherlands | 2015 | 7.324437 | 10.877559 | 0.879010 | 71.099998 | 0.903979 | 0.259138 | 0.411822 | 0.742388 | 0.202129 | 0.579621 | 16939923 |
| 20 | Australia | 2015 | 7.309061 | 10.769942 | 0.951862 | 70.599998 | 0.921871 | 0.330029 | 0.356554 | 0.749504 | 0.209637 | 0.478557 | 23815995 |
| 493 | Sweden | 2015 | 7.288922 | 10.838187 | 0.929460 | 71.400002 | 0.935072 | 0.209648 | 0.231964 | 0.766198 | 0.190992 | 0.499302 | 9799186 |
| 250 | Israel | 2015 | 7.079411 | 10.526913 | 0.864130 | 71.800003 | 0.752784 | 0.106784 | 0.789430 | 0.651632 | 0.256258 | 0.405343 | 8380100 |
| 25 | Austria | 2015 | 7.076447 | 10.875665 | 0.928110 | 70.400002 | 0.900305 | 0.096642 | 0.557480 | 0.747708 | 0.164469 | 0.454790 | 8642699 |
| 189 | Germany | 2015 | 7.037138 | 10.842699 | 0.925923 | 70.099998 | 0.889429 | 0.175081 | 0.412168 | 0.722385 | 0.202705 | 0.628004 | 81686611 |
| 45 | Belgium | 2015 | 6.904219 | 10.808846 | 0.885209 | 70.000000 | 0.869475 | 0.059686 | 0.468785 | 0.747103 | 0.239959 | 0.459024 | 11274196 |
| 542 | United States | 2015 | 6.863947 | 10.977470 | 0.903571 | 66.599998 | 0.848753 | 0.216716 | 0.697543 | 0.768671 | 0.274688 | 0.346936 | 320738994 |
| 118 | Costa Rica | 2015 | 6.854004 | 9.859624 | 0.878273 | 70.000000 | 0.906926 | -0.065053 | 0.761419 | 0.810668 | 0.286440 | 0.261169 | 4847805 |
| 245 | Ireland | 2015 | 6.830125 | 11.177575 | 0.952943 | 70.699997 | 0.892277 | 0.229410 | 0.408757 | 0.748266 | 0.225349 | 0.571841 | 4701957 |
| 300 | Luxembourg | 2015 | 6.701571 | 11.636759 | 0.933605 | 71.500000 | 0.932256 | 0.047323 | 0.375390 | 0.727878 | 0.193050 | 0.694678 | 569604 |
| 10 | Argentina | 2015 | 6.697131 | 10.083051 | 0.926492 | 66.900002 | 0.881224 | -0.175746 | 0.850906 | 0.767845 | 0.305355 | 0.378169 | 43131966 |
| 547 | Uruguay | 2015 | 6.628080 | 10.017969 | 0.891493 | 67.500000 | 0.916880 | -0.040206 | 0.673476 | 0.811902 | 0.299538 | 0.550559 | 3412013 |
#TODO: Isolate one country - your choice of country
nor = merged[merged['Country'] == 'Norway']
#TODO: Create a heatmap of your country with all measures
nor_sort = nor.sort_values(by = ['Year'], ascending = False)
nor_sort.style.background_gradient(cmap = 'Blues')
| Country | Year | Life Ladder | Log GDP | Social support | Life Expectancy | Choice Freedom | Generosity | Corruption | Positive affect | Negative affect | Government confidence | Population | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 408 | Norway | 2019 | 7.442140 | 11.073689 | 0.941784 | 71.400002 | 0.954044 | 0.106946 | 0.270572 | 0.781727 | 0.195487 | 0.597987 | 5347896 |
| 407 | Norway | 2018 | 7.444262 | 11.071957 | 0.965962 | 71.349998 | 0.960429 | 0.090105 | 0.268201 | 0.785607 | 0.211862 | 0.679503 | 5311916 |
| 406 | Norway | 2017 | 7.578745 | 11.067431 | 0.950128 | 71.300003 | 0.953017 | 0.232320 | 0.249711 | 0.800106 | 0.202914 | 0.717160 | 5276968 |
| 405 | Norway | 2016 | 7.596332 | 11.052541 | 0.959743 | 71.250000 | 0.954352 | 0.128807 | 0.409666 | 0.809430 | 0.209262 | 0.657646 | 5234519 |
| 404 | Norway | 2015 | 7.603434 | 11.050693 | 0.946834 | 71.199997 | 0.947621 | 0.252847 | 0.298814 | 0.796321 | 0.209410 | 0.586872 | 5188607 |
In the example below, we are selecting a subset of columns from 2015 with four columns - Country, Life Ladder, Log GDP and Social support. Note that we created two files, one for the bottom of 2015 based on Life Ladder and one for all of 2015. We will use these files for plotting examples.
# isolate four fields for the bottom 20 of 2015
b2015 = year2015.sort_values(by = ['Life Ladder'], ascending = True)[:20]
subset2015b = b2015[['Country', 'Life Ladder', 'Log GDP', 'Social support']]
# isolate four fields for all of 2015
subset2015all = year2015[['Country', 'Life Ladder', 'Log GDP', 'Social support']]
You can change the style of your plots, but once you set the style, it is set for the whole notebook whenever you use Seaborn.
# seaborn styles
print(plt.style.available)
['Solarize_Light2', '_classic_test_patch', '_mpl-gallery', '_mpl-gallery-nogrid', 'bmh', 'classic', 'dark_background', 'fast', 'fivethirtyeight', 'ggplot', 'grayscale', 'seaborn-v0_8', 'seaborn-v0_8-bright', 'seaborn-v0_8-colorblind', 'seaborn-v0_8-dark', 'seaborn-v0_8-dark-palette', 'seaborn-v0_8-darkgrid', 'seaborn-v0_8-deep', 'seaborn-v0_8-muted', 'seaborn-v0_8-notebook', 'seaborn-v0_8-paper', 'seaborn-v0_8-pastel', 'seaborn-v0_8-poster', 'seaborn-v0_8-talk', 'seaborn-v0_8-ticks', 'seaborn-v0_8-white', 'seaborn-v0_8-whitegrid', 'tableau-colorblind10']
This stacked barchart below is using the data from the bottom 20 countries for 2015.
# example using the seaborn style
plt.style.use('seaborn')
subset2015b.plot(x = 'Country', kind = 'bar', stacked = True, title = 'Happiness Stacked Bar Chart 2015',
ylabel = 'Happiness Scores')
ax.legend
plt.show();
C:\Users\rites\AppData\Local\Temp\ipykernel_19552\3829428293.py:2: MatplotlibDeprecationWarning: The seaborn styles shipped by Matplotlib are deprecated since 3.6, as they no longer correspond to the styles shipped by seaborn. However, they will remain available as 'seaborn-v0_8-<style>'. Alternatively, directly use the seaborn API instead.
plt.style.use('seaborn')
plt.style.use('fivethirtyeight')
fig, ax = plt.subplots(figsize = (6, 25))
subset2015all = subset2015all.sort_values(by = 'Life Ladder')
ax.barh(subset2015all['Country'], subset2015all['Life Ladder'], align = 'center')
ax.set(xlabel = 'Life Ladder', ylabel = 'Country',
title = 'Happiness Life Ladder 2015');
plt.style.use('dark_background')
fig, ax = plt.subplots(figsize = (8, 6))
x = subset2015b['Country']
y1 = subset2015b['Life Ladder']
y2 = subset2015b['Log GDP']
y3 = subset2015b['Social support']
plt.bar(x, y1, color = 'r');
plt.bar(x, y2, bottom = y1, color = 'b');
plt.bar(x, y3, bottom = y1+y2, color = 'g');
plt.style.use('ggplot')
subset2015sorted = subset2015b.sort_values(by = 'Log GDP', ascending = True)
subset2015sorted.plot(x = 'Country', kind = 'barh', stacked = True, title = '2015 Happiness')
plt.xlabel('Category scores');
plt.show();
The plotly bubble chart below shows Life Ladder, Life Expectancy and Population. This uses the y-axis, x-axis and the size of the circle represents the size of the population. The purpose of this plot is to visually inspect the countries selected in regards to three variables selected.
Hover over each circle to see the three values plotted for that country.
# set theme back to default
sns.set_theme()
# bubble chart
fig = px.scatter(b2015, x = "Life Expectancy", y = "Life Ladder", color = 'Country',
size = 'Population', size_max = 60, text = 'Country',
title='Happiness and Population in 2015',
color_discrete_sequence = px.colors.qualitative.Bold)
fig.update_layout(xaxis_title = 'Life Expectancy',
yaxis_title = 'Life Ladder');
fig.show();
# TODO: Create a subset of 10 countries using only 2019 data
year2019 = merged[merged['Year'] == '2019']
b2019 = year2019.sort_values(by = ['Log GDP'], ascending = False)[:10]
subset2019 = b2019[['Country', 'Log GDP', 'Life Expectancy', 'Corruption']]
subset2019
| Country | Log GDP | Life Expectancy | Corruption | |
|---|---|---|---|---|
| 304 | Luxembourg | 11.665803 | 71.599998 | 0.389598 |
| 471 | Singapore | 11.496914 | 73.599998 | 0.069620 |
| 249 | Ireland | 11.369633 | 71.099998 | 0.372804 |
| 502 | Switzerland | 11.169651 | 72.500000 | 0.293701 |
| 408 | Norway | 11.073689 | 71.400002 | 0.270572 |
| 546 | United States | 11.045013 | 66.099998 | 0.706716 |
| 141 | Denmark | 10.953639 | 71.000000 | 0.174151 |
| 378 | Netherlands | 10.947011 | 71.400002 | 0.360068 |
| 29 | Austria | 10.930130 | 70.900002 | 0.457089 |
| 193 | Germany | 10.895435 | 70.900002 | 0.462255 |
# TODO: Create bubble chart
fig = px.scatter(subset2019, x = 'Life Expectancy', y = "Log GDP", color = 'Country',
size = 'Corruption', size_max = 60, text = 'Country',
title='Life Expectancy, Log GDP, and Corruption in 2019',
color_discrete_sequence = px.colors.qualitative.Bold)
fig.update_layout(xaxis_title = 'Life Expectancy',
yaxis_title = 'Log GDP');
fig.show();
The analysis focuses on the ten countries with the highest Log GDP to explore the relationship between Corruption and Life Expectancy within economically prosperous nations - top ten countires with the highest Log GDPs.
The three variables, Log GDP (an economic indicator), Corruption (a governance indicator), and Life Expectancy (a health indicator), were chosen to examine how wealth and governance might impact public health.
Among the selected countries, Singapore stands out with the lowest level of corruption and the highest life expectancy, whereas the United States exhibits the highest corruption level and the lowest life expectancy. The remaining eight countries, all European, display similar figures for Log GDP, Corruption, and life expectancy, indicating a relatively consistent pattern within the European context.
Very often when you start working with data, some insight may appear that was not part of your original questions. In this case as we look at the top 20 countries based on Life Ladder, there are quite a few from Europe. Could the level of happiness vary by Continent?
To add a new Continentfield to our data, we will use a new library, pycountry_convert, along with functions, map and a data dictionary.
# pip install pycountry-convert
from pycountry_convert import country_name_to_country_alpha2, country_alpha2_to_continent_code
map and a data dictionary for ease of readability when using in visualsNote that without the availability of the pycountry_convert library, we would have had to create our own translation of Country to Continent - and that would have been a lot of coding!
# function to use the country to get a two character country code
def country_code(country):
try:
#print(country)
c_code = country_name_to_country_alpha2(country)
except:
print('Not in lookup: ', country)
return(c_code)
# call function and save country code into our dataframe
merged['Country_code'] = merged['Country'].map(country_code)
merged.sample(5)
| Country | Year | Life Ladder | Log GDP | Social support | Life Expectancy | Choice Freedom | Generosity | Corruption | Positive affect | Negative affect | Government confidence | Population | Country_code | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 209 | Guinea | 2015 | 3.504694 | 7.643931 | 0.578860 | 51.500000 | 0.665953 | 0.006996 | 0.762152 | 0.658104 | 0.267741 | 0.610288 | 11432096 | GN |
| 414 | Panama | 2015 | 6.605550 | 10.255235 | 0.882615 | 68.300003 | 0.846669 | -0.009005 | 0.809943 | 0.777305 | 0.263826 | 0.375585 | 3968490 | PA |
| 324 | Malta | 2016 | 6.590842 | 10.609963 | 0.930369 | 71.349998 | 0.916024 | 0.342254 | 0.696495 | 0.644832 | 0.355444 | 0.618818 | 455356 | MT |
| 450 | Rwanda | 2018 | 3.561047 | 7.644321 | 0.616173 | 59.875000 | 0.924232 | 0.057375 | 0.163810 | 0.765132 | 0.308199 | 0.988120 | 12301969 | RW |
| 513 | Thailand | 2017 | 5.938895 | 9.765544 | 0.877269 | 68.150002 | 0.922897 | 0.210881 | 0.883817 | 0.775898 | 0.231598 | 0.605079 | 69209817 | TH |
# function to use two character country code to get the continent code
def continent(c_code):
try:
cont = country_alpha2_to_continent_code(c_code)
except:
print('Not in lookup: ', c_code)
cont = 'None'
return(cont)
# call function and store into our dataframe
merged['Continent_code'] = merged['Country_code'].map(continent)
merged.sample(5)
| Country | Year | Life Ladder | Log GDP | Social support | Life Expectancy | Choice Freedom | Generosity | Corruption | Positive affect | Negative affect | Government confidence | Population | Country_code | Continent_code | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 105 | Chad | 2019 | 4.250799 | 7.364944 | 0.640452 | 52.000000 | 0.537246 | 0.055652 | 0.832283 | 0.556211 | 0.460061 | 0.571986 | 15946882 | TD | AF |
| 486 | Spain | 2017 | 6.230173 | 10.584788 | 0.903158 | 71.849998 | 0.755561 | -0.034348 | 0.791269 | 0.601179 | 0.302388 | 0.269586 | 46593236 | ES | EU |
| 100 | Central African Republic | 2017 | 3.475862 | 6.816520 | 0.319589 | 45.299999 | 0.645252 | 0.073952 | 0.889566 | 0.602205 | 0.599335 | 0.650285 | 4596023 | CF | AF |
| 441 | Portugal | 2018 | 5.919823 | 10.435313 | 0.887113 | 70.875000 | 0.877404 | -0.263654 | 0.879728 | 0.645732 | 0.317995 | 0.520631 | 10283822 | PT | EU |
| 189 | Germany | 2015 | 7.037138 | 10.842699 | 0.925923 | 70.099998 | 0.889429 | 0.175081 | 0.412168 | 0.722385 | 0.202705 | 0.628004 | 81686611 | DE | EU |
# look at counts per continent
merged['Continent_code'].value_counts()
EU 182 AF 156 AS 117 NA 57 SA 44 OC 10 Name: Continent_code, dtype: int64
# translate continent code to continent label
contMap = {'EU':'Europe',
'AF':'Africa',
'AS':'Asia',
'NA':'North America',
'SA':'South America',
'OC':'Oceana'}
merged['Continent'] = merged['Continent_code'].map(contMap)
# double check mapping of names
merged['Continent'].value_counts()
Europe 182 Africa 156 Asia 117 North America 57 South America 44 Oceana 10 Name: Continent, dtype: int64
Below shows the mean values for each Happiness measure based on the continent.
merged.groupby('Continent').mean(numeric_only = True)
| Life Ladder | Log GDP | Social support | Life Expectancy | Choice Freedom | Generosity | Corruption | Positive affect | Negative affect | Government confidence | Population | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Continent | |||||||||||
| Africa | 4.341149 | 8.104643 | 0.696419 | 55.270696 | 0.730466 | -0.011639 | 0.781105 | 0.654864 | 0.328178 | 0.595115 | 2.648490e+07 |
| Asia | 5.198648 | 9.329025 | 0.785495 | 64.200641 | 0.784404 | 0.063934 | 0.735630 | 0.612498 | 0.287344 | 0.590419 | 1.092152e+08 |
| Europe | 6.280222 | 10.428544 | 0.895217 | 69.060303 | 0.790161 | -0.028179 | 0.674871 | 0.644895 | 0.243844 | 0.415891 | 1.621403e+07 |
| North America | 6.181226 | 9.572108 | 0.843899 | 65.164035 | 0.826778 | -0.004400 | 0.742341 | 0.767733 | 0.283242 | 0.404504 | 4.861473e+07 |
| Oceana | 7.277362 | 10.718973 | 0.949862 | 70.400000 | 0.926028 | 0.229548 | 0.312776 | 0.753047 | 0.195882 | 0.543951 | 1.469741e+07 |
| South America | 6.092384 | 9.627538 | 0.869582 | 67.113636 | 0.831789 | -0.097008 | 0.809598 | 0.758493 | 0.319523 | 0.321939 | 4.404396e+07 |
# TODO: Create seaborn pairplots with color based on the instructions above
cols = ['Life Ladder', 'Log GDP', 'Life Expectancy', 'Choice Freedom', 'Corruption', 'Continent']
sns.pairplot(merged[cols], hue='Continent')
<seaborn.axisgrid.PairGrid at 0x20896adbb80>
Life Ladder appears to be positively associated with Log GDP, Life Expectancy, and Choice Freedom. European countries tend to score high in these areas, with the exception of Corruption where a negative correlation is observed, indicating lower levels of corruption associated with better outcomes in the other metrics. Conversely, countries in Africa exhibit lower performance on these indicators, suggesting challenges in economic prosperity, life span, and freedom of choice. Corruption seems to be more prevelant in the African countires.
Module 5 and Module 6 homework code can be used as examples for your own EDA phase 1 and phase 2 homework. After each phase you should be writing out your data so that it can be read into the next phase. After EDA 2, you will be writing an Executive Summary in Jupyter Notebook where you should do no data cleaning, but simply read in your final file and provide your analysis.
merged.to_csv('Happiness_final.csv', header = True, index = False)
The code shows the following:
We will take the 10 largest countries in regard to Population in 2015. Notice that we are using index and .loc to get all columns of data and storing this into a new dataframe called largest.
index = year2015['Population'].nlargest(10).index
index
largest = merged.loc[index]
largest
Int64Index([231, 542, 236, 71, 409, 394, 35, 262, 337, 428], dtype='int64')
| Country | Year | Life Ladder | Log GDP | Social support | Life Expectancy | Choice Freedom | Generosity | Corruption | Positive affect | Negative affect | Government confidence | Population | Country_code | Continent_code | Continent | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 231 | India | 2015 | 4.342079 | 8.606009 | 0.610133 | 59.099998 | 0.777225 | -0.005791 | 0.776435 | 0.657201 | 0.321829 | 0.694717 | 1310152392 | IN | AS | Asia |
| 542 | United States | 2015 | 6.863947 | 10.977470 | 0.903571 | 66.599998 | 0.848753 | 0.216716 | 0.697543 | 0.768671 | 0.274688 | 0.346936 | 320738994 | US | NA | North America |
| 236 | Indonesia | 2015 | 5.042800 | 9.225190 | 0.809478 | 62.299999 | 0.779418 | 0.470236 | 0.945967 | 0.796219 | 0.274292 | 0.645185 | 258383257 | ID | AS | Asia |
| 71 | Brazil | 2015 | 6.546897 | 9.620074 | 0.906693 | 64.699997 | 0.798935 | -0.017049 | 0.771339 | 0.687064 | 0.324699 | 0.198535 | 204471759 | BR | SA | South America |
| 409 | Pakistan | 2015 | 4.823195 | 8.361321 | 0.561720 | 55.799999 | 0.586546 | 0.085119 | 0.716641 | 0.469452 | 0.328647 | 0.459588 | 199426953 | PK | AS | Asia |
| 394 | Nigeria | 2015 | 4.932915 | 8.615186 | 0.811648 | 53.099998 | 0.680470 | -0.035880 | 0.926109 | 0.714879 | 0.251190 | 0.410358 | 181137454 | NG | AF | Africa |
| 35 | Bangladesh | 2015 | 4.633474 | 8.216118 | 0.601468 | 63.799999 | 0.814796 | -0.068596 | 0.720601 | 0.543084 | 0.225754 | 0.760612 | 156256287 | BD | AS | Asia |
| 262 | Japan | 2015 | 5.879684 | 10.606649 | 0.922657 | 73.599998 | 0.831694 | -0.158819 | 0.654443 | 0.702269 | 0.176409 | 0.352867 | 127141000 | JP | AS | Asia |
| 337 | Mexico | 2015 | 6.236287 | 9.866248 | 0.760614 | 65.800003 | 0.719466 | -0.153377 | 0.707972 | 0.706145 | 0.237188 | 0.256193 | 121858251 | MX | NA | North America |
| 428 | Philippines | 2015 | 5.547489 | 8.895648 | 0.853589 | 61.900002 | 0.911534 | -0.051945 | 0.755192 | 0.796322 | 0.350588 | 0.668414 | 102113206 | PH | AS | Asia |
Since we plan to plot for all the years in the data, we need to get all the data for our top ten countries from the merged dataframe.
comparelist = list(largest['Country'])
comparelist
compare = merged[merged['Country'].isin(comparelist)]
['India', 'United States', 'Indonesia', 'Brazil', 'Pakistan', 'Nigeria', 'Bangladesh', 'Japan', 'Mexico', 'Philippines']
This plot shows the choice of freedom happiness category for our ten countries from 2015 to 2019.
fig = px.line(compare, x = "Year", y = "Choice Freedom", color='Country',
title='Freedom of Choice ', color_discrete_sequence = px.colors.qualitative.Bold)
fig.update_layout(height = 600, xaxis_title = 'Year')
There will be times when you will want to look at sections of your data to see if there are differences between groupings. One way to do this is with ranking of data and the creating groups. This is different from clustering because you are chosing the ranking variable and the cut offs for the groups.
# create a copy of 2015 data
year2015copy = year2015.copy()
When you look at five sample rows of data after the ranking, you can see a new column called popRank that has given the most populous country in 2015 a value of 1 and the least populous a value of 113.
year2015copy = year2015copy.reset_index()
year2015copy['popRank'] = year2015copy['Population'].rank(ascending = False)
year2015copy.sample(5)
| index | Country | Year | Life Ladder | Log GDP | Social support | Life Expectancy | Choice Freedom | Generosity | Corruption | Positive affect | Negative affect | Government confidence | Population | popRank | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 51 | 250 | Israel | 2015 | 7.079411 | 10.526913 | 0.864130 | 71.800003 | 0.752784 | 0.106784 | 0.789430 | 0.651632 | 0.256258 | 0.405343 | 8380100 | 73.0 |
| 25 | 123 | Croatia | 2015 | 5.205438 | 10.122007 | 0.768363 | 67.900002 | 0.693523 | -0.099673 | 0.848546 | 0.570057 | 0.294019 | 0.364868 | 4203604 | 91.0 |
| 23 | 111 | Colombia | 2015 | 6.387572 | 9.553629 | 0.889900 | 68.300003 | 0.790898 | -0.100814 | 0.842899 | 0.803392 | 0.291769 | 0.271787 | 47520667 | 21.0 |
| 108 | 542 | United States | 2015 | 6.863947 | 10.977470 | 0.903571 | 66.599998 | 0.848753 | 0.216716 | 0.697543 | 0.768671 | 0.274688 | 0.346936 | 320738994 | 2.0 |
| 81 | 409 | Pakistan | 2015 | 4.823195 | 8.361321 | 0.561720 | 55.799999 | 0.586546 | 0.085119 | 0.716641 | 0.469452 | 0.328647 | 0.459588 | 199426953 | 5.0 |
When creating categories, the number of categories and the breakdown of each category will depend on the data. For example purposes, this data will be broken into three categories with the splits based on the popRank column.
from pandas import Categorical
# create categorical variable for rankings - divide in thirds
top_cat = 37 # gives us top third of rankings based on 113 countries
low_cat = 75 # gives us bottom third of rankings
# assign a category of 1, 2, or 3 based on how ranked with teeth value
year2015copy['popCat'] = Categorical(np.where(year2015copy['popRank'] <= top_cat,1,2))
year2015copy['popCat'] = Categorical(np.where(year2015copy['popRank'] >= low_cat,3,year2015copy['popCat']))
year2015copy['popCat'].value_counts()
3 39 1 37 2 37 Name: popCat, dtype: int64
temp = year2015copy[['Country','Population','popRank']]
temp.sort_values(by = ['popRank'])
| Country | Population | popRank | |
|---|---|---|---|
| 47 | India | 1310152392 | 1.0 |
| 108 | United States | 320738994 | 2.0 |
| 48 | Indonesia | 258383257 | 3.0 |
| 15 | Brazil | 204471759 | 4.0 |
| 81 | Pakistan | 199426953 | 5.0 |
| ... | ... | ... | ... |
| 11 | Bhutan | 727885 | 109.0 |
| 70 | Montenegro | 622159 | 110.0 |
| 60 | Luxembourg | 569604 | 111.0 |
| 65 | Malta | 445053 | 112.0 |
| 46 | Iceland | 330815 | 113.0 |
113 rows × 3 columns
Using groupby we can see the mean values for each category.
Keep in mind that Group 1 is the most populous and Group 3 is the least populous.
# how do our groupings differ?
year2015copy.groupby('popCat').mean(numeric_only = True)
| index | Life Ladder | Log GDP | Social support | Life Expectancy | Choice Freedom | Generosity | Corruption | Positive affect | Negative affect | Government confidence | Population | popRank | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| popCat | |||||||||||||
| 1 | 307.027027 | 5.410703 | 9.241387 | 0.807442 | 63.035135 | 0.768396 | 0.080246 | 0.752726 | 0.679936 | 0.271805 | 0.488665 | 1.140240e+08 | 19.0 |
| 2 | 263.189189 | 5.333634 | 9.096096 | 0.778766 | 61.875676 | 0.757791 | -0.010320 | 0.735873 | 0.677207 | 0.270880 | 0.528884 | 1.345873e+07 | 56.0 |
| 3 | 270.615385 | 5.612790 | 9.797522 | 0.823479 | 65.533333 | 0.753242 | 0.009153 | 0.707278 | 0.654233 | 0.276783 | 0.433603 | 3.788129e+06 | 94.0 |
Even though there appear to be difference between the three categories when looking at the means, the plots below do not show any patterns.
# pairplots on three fields
#columns = year2015copy.columns[1:5]
#columns
sns.pairplot(year2015copy, hue = 'popCat', kind = 'scatter', corner = True,
vars = ['Life Ladder', 'Log GDP', 'Corruption', 'Life Expectancy'])
<seaborn.axisgrid.PairGrid at 0x2089a68e8c0>